A Comparison of Statistical Models for the Extraction of Lexical Information from Text Corpora
نویسنده
چکیده
The Syntagmatic Paradigmatic model (SP; Dennis & Harrington 2001, Dennis submitted) and the Pooled Adjacent Context model (PAC; Redington, Chater & Finch 1998) are compared on their ability to extract syntactic, semantic and associative information from a corpus of text. On a measure of syntactic class (and subclass) information based on the WordNet lexical database (Miller 1990), the models performed similarly with a small advantage for the PAC model. On a measure of semantic structure based on the similarities produced by Latent Semantic Analysis (LSA; Landauer & Dumais 1997), the models performed equivalently with a small advantage for the SP model. On a measure of associative information based on the free association norms of Nelson, McEvoy & Schreiber (1999), the SP model shows a substantive advantage over the PAC model producing more than twice as many associates.
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملAutomatic extraction of property norm-like data from large text corpora
Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car--petrol). We propose a system for the challenging task of automatic, large-scale acquisition of uncons...
متن کاملTowards a corpus-based dictionary of German noun-verb collocations
We 1 describe our attempts to automatically extract raw material for a dictionary of German noun-verb collocations from large corpora of newspaper text. Such a dictionary should be about collocations and it should include a description of their linguistic properties, rather than listing the mere lexical cooccurrence. Since most statistical collocation nding tools do not provide other than lexic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003